Kernels and Similarity Measures for Text Classification
نویسندگان
چکیده
Measuring similarity between two strings is a fundamental step in text classification and other problems of information retrieval. Recently, kernel-based methods have been proposed for this task; since kernels are inner products in a feature space, they naturally induce similarity measures. Information theoretic (dis)similarities have also been the subject of recent research. This paper describes some string kernels and information theoretic mesures and shows how they can be efficiently implemented via suffix trees. The performance of these measures is then evaluated on a text classification (authorship attribution) problem, involving a set of books by Portuguese writers.
منابع مشابه
Hilbertian Metrics and Positive Definite Kernels on Probability Measures
We investigate the problem of defining Hilbertian metrics resp. positive definite kernels on probability measures, continuing the work in [5]. This type of kernels has shown very good results in text classification and has a wide range of possible applications. In this paper we extend the two-parameter family of Hilbertian metrics of Topsøe such that it now includes all commonly used Hilbertian...
متن کاملString kernels and similarity measures for information retrieval
Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also bee...
متن کاملInterpreting compound nouns with kernel methods
This paper presents a classification-based approach to noun-noun compound interpretation within the statistical learning framework of kernel methods. In this framework, the primary modelling task is to define measures of similarity between data items, formalised as kernel functions. We consider the different sources of information that are useful for understanding compounds and proceed to defin...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملLinear-Time Computation of Similarity Measures for Sequential Data
Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams...
متن کامل